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Abstract 

Fusion  of  multiple  features  can  boost  the  performance 
of  large-scale  visual  classification  and  detection  tasks  like 
TRECVID  Multimedia  Event  Detection  (MED)  competi¬ 
tion  [1].  In  this  paper,  we  propose  a  novel  feature  fusion 
approach,  namely  Feature  Weighting  via  Optimal  Thresh¬ 
olding  (EWOT)  to  effectively  fuse  various  features.  EWOT 
learns  the  weights,  thresholding  and  smoothing  parame¬ 
ters  in  a  joint  framework  to  combine  the  decision  values 
obtained  from  all  the  individual  features  and  the  early  fu¬ 
sion.  To  the  best  of  our  knowledge,  this  is  the  first  work  to 
consider  the  weight  and  threshold  factors  of  fusion  prob¬ 
lem  simultaneously.  Compared  to  state-of-the-art  fusion  al¬ 
gorithms,  our  approach  achieves  promising  improvements 
on  HMDB  [8]  action  recognition  dataset  and  CCV  [5] 
video  classification  dataset.  In  addition,  experiments  on  two 
TRECVID  MED  2011  collections  show  that  our  approach 
outperforms  the  state-of-the-art  fusion  methods  for  complex 
event  detection. 


1.  Introduction 

The  huge  number  of  videos  uploaded  and  viewed  on 
the  Internet  makes  video  analysis  a  hot  topic  in  computer 
vision  and  multimedia  communities.  Videos  contain  rich 
information  which  can  be  represented  as  motion  features 
( e.g .,  Space-Time  Interest  Points  (STIP)  [10],  Dense  Tra¬ 
jectories  [23]),  shape  and  color  features  of  video  frames 
(e.g.,  SIFT  [12],  Color  SIFT  [21]),  and  acoustic  features 
(e.g.,  Mel-Frequency  Cepstral  Coefficients  (MFCC)).  How¬ 
ever,  not  any  individual  feature  can  capture  the  whole  infor¬ 
mation  of  a  video.  Even  for  a  single  feature,  the  state-of- 
the-art  methods  usually  combine  multiple  descriptors.  For 
example,  STIP  [10]  feature  combines  HOG  descriptor  for 
shape  information  and  HOF  descriptor  for  motion  informa¬ 
tion,  Dense  Trajectories  feature  [23]  is  an  integration  of  de¬ 


scriptors  of  trajectory,  HOG,  HOF  and  Motion  Boundary 
Histogram  (MBH). 

In  the  video  action  recognition  and  event  detection  tasks, 
researchers  have  developed  systems  which  combine  multi¬ 
ple  features.  While  performing  action  recognition  on  large- 
scale  video  datasets,  Reddy  and  Shah  [17]  found  that  com¬ 
bining  scene  features  (e.g.,  Color  SIFT)  with  motion  fea¬ 
tures  (e.g.,  STIP)  is  beneficial  for  analyzing  real-life  videos 
from  the  Internet.  As  for  event  detection  tasks,  reports  from 
teams  with  top  performance  [26, 14, 15]  in  TRECVID  MED 
competition  show  that  fusion,  either  feature-level  fusion  or 
decision-level  fusion  brings  performance  gain  into  the  de¬ 
tection  tasks. 

Fusion  mechanisms  can  be  grouped  into  two  types  which 
are  feature-level  fusion  and  decision-level  fusion.  In  the 
feature-level  fusion,  a  linear  combination  of  kernel  matri¬ 
ces  from  different  features  is  used  to  capture  the  structure 
of  video  data  [18].  One  simple  and  effective  way  in  the 
feature-level  fusion,  namely  average  early  fusion,  is  to  av¬ 
erage  multiple  kernel  matrices  and  the  average  kernel  ma¬ 
trix  is  used  as  similarity  measure  for  classifier  training. 
The  other  fusion  mechanism  is  decision-level  fusion,  which 
adopts  classifiers  to  features  and  then  fuses  the  results  based 
on  the  confidence  scores.  Lan  et  al.  [9]  find  that  combin¬ 
ing  the  decision  values  obtained  from  the  kernel  matrices  of 
individual  features  and  the  average  distances  of  all  the  fea¬ 
tures  will  gain  better  performance  than  using  the  decision 
values  from  each  individual  features  only. 

The  most  widely  used  decision-level  fusion  method  is 
to  assign  average  weights  to  confidence  scores  from  each 
feature,  which  may  restrain  the  overall  performance  due  to 
the  inconsistency  and  incomparability  of  confidence  scores 
from  different  models.  Intuitively,  in  decision-level  fusion, 
different  features  should  have  different  weights  since  they 
may  not  contribute  equally  to  the  final  decision.  Taking 
complex  events  detection  in  TRECVID  MED  task  as  an  ex¬ 
ample.  Table  1  shows  the  Average  Precision  (AP)  of  de¬ 
tection  results  from  Dense  Trajectories,  STIP  and  MFCC 
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Event  Name 

MFCC 

Trajectories 

STIP 

Birthday  Party 

21.3% 

13.1% 

7.8% 

Changing  a  vehicle  tire 

3.9% 

21.9% 

4.1% 

Working  a  sewing  project 

11.8% 

17.4% 

11.8% 

Table  1 .  Average  Precision  for  three  different  features 


respectively.  In  this  experiment,  x2 -kernel  SVM  is  used  as 
the  classifier.  For  the  event  “Birthday  party”,  the  acous¬ 
tic  feature  MFCC  achieves  the  best  prediction  performance, 
and  it  is  much  better  than  visual  motion  features.  The  reason 
is  that  singing  and  laughing  sound  in  a  birthday  party  is  well 
captured  by  MFCC.  Differently,  for  the  event  “Changing  a 
vehicle  tire”,  acoustic  information  becomes  less  discrimi¬ 
native  so  that  MFCC  gets  worse  performance  than  Dense 
Trajectories  feature.  The  situation  of  STIP  is  the  same  as 
MFCC.  It  achieves  good  performance  for  some  events  while 
performs  worse  for  others.  In  this  example  we  can  see  that 
different  features  do  not  contribute  equally  to  the  task  and 
therefore  their  weights  should  not  be  identical. 

Another  issue  in  decision-level  fusion  is  the  difference  of 
thresholds  among  confidence  scores  from  different  models. 
Assume  that  we  retrieve  the  top  500  videos  among  32,000 
testing  videos  according  to  the  confidence  scores.  Table  2 
shows  that  the  threshold  of  confidence  scores  from  different 
models  can  be  very  different.  For  example,  Dense  Trajecto¬ 
ries  feature  has  higher  threshold  than  others,  which  means 
that  in  the  prediction  using  Dense  Trajectories  feature,  only 
videos  with  very  high  confidence  scores  should  be  consid¬ 
ered  as  positive  results.  If  the  effects  of  the  difference  of 
thresholds  among  predictive  results  are  ignored,  it  would 
degrade  the  discriminative  ability  of  the  fusion  result. 

In  this  paper,  we  propose  a  method  for  feature  fusion. 
We  name  the  proposed  method  Feature  Weighting  via  Opti¬ 
mal  Thresholding  (FWOT).  As  aforementioned,  the  weights 
and  thresholds  of  multiple  features  are  two  factors  to  be 
considered  for  feature  fusion.  In  light  of  this,  the  fusion 
algorithm  proposed  in  this  paper  integrates  feature  weight¬ 
ing  and  thresholds  selection  into  a  joint  framework.  Our 
premise  is  that  the  weight  and  threshold  of  each  feature  are 
correlated  and  the  joint  optimization  of  both  makes  them 
mutually  beneficial  and  reciprocal.  The  optimal  weight  of 
a  feature  is  dependent  on  the  threshold,  making  it  not  only 
to  accurately  reflect  the  importance  of  the  feature,  but  also 
more  suitable  for  making  the  classification/detection  deci¬ 
sion.  Inspired  by  [9],  we  combine  the  early  fusion  result 
at  the  decision-level  fusion.  To  the  best  of  our  knowledge, 
this  is  the  first  work  which  optimizes  weights  and  thresh¬ 
olds  simultaneously  for  fusion.  Instead  of  directly  solving 
a  non-convex  and  time  consuming  problem,  we  preset  a  se¬ 
ries  of  thresholds  as  candidates,  which  in  turn  transforms 
the  problem  from  detecting  the  optimal  thresholds  to  se¬ 
lecting  the  best  thresholds  from  the  candidates.  Further,  to 
make  the  algorithm  more  flexible  and  robust,  we  addition¬ 


Event  Name 

MFCC 

Trajectories 

STIP 

Birthday  party 

0.075 

0.106 

0.075 

Changing  a  vehicle  tire 

0.059 

0.085 

0.075 

Working  a  sewing  project 

0.046 

0.091 

0.086 

Table  2.  Thresholds  for  different  models 


ally  introduce  a  group  of  smooth  factors  to  soften  the  clas¬ 
sification/detection  decision  from  discrete  values  to  contin¬ 
uous  domain.  In  this  way,  the  algorithm  is  formulated  as  a 
Mixed  Integer  Program  (MIP)  problem.  As  the  MIP  prob¬ 
lem  is  NP-hard,  we  relax  it  to  a  convex  optimization  prob¬ 
lem,  which  is  the  lower  bound  of  the  original  MIP  problem. 
We  then  apply  cutting  plane  algorithm  to  efficiently  solve 
the  problem  with  almost  linear  time  complexity.  In  that  way, 
the  optimized  weights  and  thresholds  can  be  obtained. 

2.  Related  Work 

Multiple  Kernel  Learning  (MKL)  [16]  is  the  most  pop¬ 
ular  way  for  combining  different  kernels  to  utilize  the  ad¬ 
vantages  of  different  features  in  applications  such  as  visual 
object  classification,  object  detection  and  video  semantic 
analysis.  Vedaldi  et  al.  [22]  use  the  MKL  method  to  learn 
the  optimal  combination  of  exponential  x2 -kernels  of  edges 
features,  dense  and  sparse  visual  words  and  feature  descrip¬ 
tors  at  different  spatial  levels.  They  successfully  trained 
and  tested  a  detector  in  a  reasonable  time  and  achieved 
the  best  performance  on  the  PASCAL  VOC  2007  and  2008 
benchmarks.  However,  Gehler  and  Nowozin  have  recently 
pointed  out  in  [4]  that  MKL  may  be  less  competitive  than 
average  combination  when  the  individual  kernels  are  dis¬ 
criminative  already. 

Recently,  Yang  et  al.  have  proposed  a  semi- supervised 
algorithm  to  fuse  the  information  from  multiple  features. 
The  experiment  shows  that  it  is  beneficial  to  exploit  the  un¬ 
labeled  data  for  multiple  feature  fusion  when  the  labeled 
data  are  few.  Ma  et  al.  propose  to  use  multiple  features  to 
learn  different  types  of  video  attributes  for  event  detection. 
However,  the  algorithms  proposed  in  [25]  and  [13]  assign 
an  equal  weight  to  different  features,  even  though  the  fea¬ 
tures  may  not  be  equally  important. 

Natarajan  et  al.  [14,  15]  propose  a  decision-level  fu¬ 
sion  method  particularly  for  event  detection.  The  algorithm 
adaptively  fuses  multiple  features,  which  assigns  videos 
with  the  weights  based  on  the  detection  thresholds.  The 
adaptive  decision-level  fusion  assigns  lower  weights  to  spe¬ 
cific  scores  if  the  confidence  scores  are  near  the  threshold 
while  assigns  higher  weights  to  videos  if  the  confidence 
scores  are  very  far  away  from  the  threshold.  Thresholds 
are  set  before  the  fusion  stage.  Though  it  is  a  reasonable 
way  to  assign  weights  to  features  according  to  the  detection 
threshold,  this  method  highly  depends  on  the  preset  detec¬ 
tion  threshold. 
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Figure  1.  An  illustration  of  our  Feature  Weighting  via  Optimal  Thresholding  (FWOT)  fusion  method 


Figure  2.  f(x)  =  tanh  ^  of  different  parameters 

3.  The  Proposed  Approach 

In  this  section,  we  first  elaborate  the  formulation  of  the 
proposed  approach.  Then  we  show  the  detailed  steps  to  ob¬ 
tain  the  optimal  fusion  function.  Figure  1  is  the  illustration 
of  our  FWOT  method. 

3.1.  Problem  Formulation 

Suppose  there  are  n  training  videos,  we  denote  each 
video  as  a  variable  G  Rd(l  <  m  <  n),  and  its  la¬ 
bel  as  ym  G  {—1,  +1},  where  =  +1  indicates  is 
a  positive  exemplar  and  ym  =  —1  indicates  is  a  neg¬ 
ative  one.  Assuming  that  we  have  t  features,  we  can  train 
classifiers  /i(x),  /2(x), . . . ,  ft(x )  according  to  features  of 
videos.  One  simple  function  to  combine  the  confidence 
scores  is 

t 

f(x)  —^WiSg n(fi(x)  -  bi),  (1) 

2=1 

where  Wi  and  bi  are  the  weight  and  the  threshold  for  confi¬ 
dence  scores  of  the  i- th  feature  respectively.  The  function 
in  (1)  indicates  that  for  the  i- th  feature,  if  the  confidence 
score  is  above  the  threshold  bi ,  the  video  would  be  labeled 
as  +1;  otherwise  —1,  and  then  we  combine  the  label  values 
according  to  weights  wim  However,  the  sgn(-)  function  here 
makes  the  fusion  process  inflexible,  since  videos  with  much 
higher  confidence  scores  than  the  threshold  and  those  with 


confidence  scores  a  little  bit  higher  than  the  threshold  would 
contribute  equally  to  the  fusion  result.  Instead  of  using  the 
hard  label  function  sgn(-),  we  adopt  the  tanh(-)  function 
with  a  smoothing  parameter  a  to  generate  soft  labels.  Fig¬ 
ure  2  shows  the  curves  of  tanh(-)  when  the  parameter  a  is 
set  to  different  values.  It  can  be  seen  that  when  a  is  getting 
smaller,  tanh(-)  tends  to  provide  hard  labels  as  sgn(-).  To 
make  the  model  more  appropriate  to  the  data  distribution 
and  utilize  the  training  videos  adaptively,  we  take  a*  as  an 
optimization  variable.  Thus  the  final  fusion  function  can  be 
formulated  as, 

f(x )  =  .  (2) 

2=1 

As  the  smoothing  parameters  a  are  tightly  correlated  to 
the  thresholds,  we  formulate  the  problem  as  selecting  the 
most  appropriate  combination  of  thresholds  b  and  smooth¬ 
ing  parameters  a,  based  on  which  the  optimal  weights  w 
are  learned.  In  particular,  after  we  get  the  confidence 
scores  for  the  i-th  feature,  we  can  uniformly  sample  s  con¬ 
fidence  scores  as  threshold  candidates,  which  are  denoted 
as  bn,  bi2: . . . ,  bis.  We  also  preset  r  smoothing  parameters 
an,  .  •  • ,  air  for  each  feature.  Then  we  learn  the  weights 
w  simultaneously  based  on  (2). 

To  step  further,  we  define  a  function  :  RtXsXr  — 
as  [^(X)]i  =  J2j  k  Xijk>  and  introduce  an  indicator  matrix 
D  e  {0,  l}tXsXr  with  Yjj,kDijk  =  1>  where  Dijk  =  1 
indicates  that  the  j- th  threshold  bij  and  the  k- th  smooth¬ 
ing  parameter  are  selected  for  i- th  feature’s  confidence 
scores.  Furthermore,  we  define  a  function  gjj  :  »  M* 

as: 

gD(x)  =  *(D0F(x)),  (3) 

where  F^dx)  =  tanh  an(j  q  js  Hadamard 

J  v  7  aik 

product.  Denoting  the  fusion  classifier  as  f(x)  = 
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wTgD(%),  to  learn  weights  for  different  features,  a  straight¬ 
forward  way  is  to  minimize  the  following  risk  function: 

n 

fi(IMIp)  +  C  E  t°SS(-ymWT9D(xm)),  (4) 

m=  1 

where  f£(||u;||p)  is  the  regularizer,  ioss(-)  is  a  convex  loss 
function,  and  C  >  0  is  a  regularization  parameter.  Here  we 
use  squared  hinge  loss  and  n(||iu||p)  =  |||u;||2,  then  the 
objective  function  can  be  formulated  as  follows: 

n 

min  min  \  ||u;||2  +  §  ^  -  p  (5) 

DeQ.w,p£  in  ii  2 

ra= 1 

s.t.  ymwTgD(xm)  >  p  -  £m,Vrn  =  1, . . .  ,n 

where  Q,  =  {D\D  e  {0,1  }txsxr Dijk  =  1}  is  the 
feasible  set  of  indicator  matrix  P.  Denoting  Lagrange  mul¬ 
tipliers  A m  for  the  inequality  constraint  of  inner  optimiza¬ 
tion  problem  as  a  vector  A,  where  A  =  [Ai,  A2, . . . ,  An]T, 
(5)  can  be  solved  by  its  dual: 

^  n  n  ^ 

min  max  —  ^  EE  Am  ^qVmyq^D  (%)  %q)  2C^ 

m=l  q=  1  “ 

where  kD(xm,xq )  =  [gD(xm)]T gD(xq),  and  A  = 
{A|Am  >  0,  J2m=i  =  1}  is  the  domain  of  the  vector 
A. 

Noting  that  (6)  is  a  Mixed  Integer  Program  (MIP),  in 
which  A  has  exponential  size.  Following  [11,  20],  we  re¬ 
lax  (6)  to  a  convex  optimization  problem.  Next,  we  show 
that  (6)  is  lower-bounded  by 

SS“5(AO!')T(  £  <vK"  +  l/)(A©!/) 

\p-.DPe  o  f 

(7) 

where  =  kDP( xm,xq),  M  =  {^|  =  l,Mp  > 

0},  and  A  =  {A|  ]Tm  A m  =  1,  Am  >  0}. 

According  to  the  minimax  inequality  stated  in  [7],  prob¬ 
lem  (6)  is  lower-bounded  by  interchanging  max^eA  and 
minoG^,  as 


AeA  BgQ  2  EE  Am  \y  m  VqkD  (xm  i  Xq)  ~  L;ATA. 

m— 1  q=  1 

(8) 

By  introducing  a  variable  0,  this  can  be  simplified  as  fol¬ 
lows, 


max  -0  :0>  -S(\,  Dp),\/Dp  e  Q,  (9) 


where  ^(A,  DP)  —  2  —  1  ^^jq—  1  ^m^qVmyq^DP  (%mi  %q\ 

By  setting  the  derivative  of  the  Lagrangian  of  (9)  w.r.t.  6 


to  zero,  we  have  the  condition  for  Lagrange  multipliers 
lip  >  0  of  inequality  constraint  in  (9)  as  ^2p  gp  =  1.  Let 
li  =  [/ii,/i2, . . *  ,Mp]t  be  the  vector  for  /jlp,  (9)  can  be 
further  rewritten  as 

max  min  fipS(\,Dp)  (10) 

aga  neM  ' 
dpem 

where  M  =  {/i\/ip  >  0,  J2P  ^p  =  1}  is  the  domain  of 
Lagrange  multipliers  vector  fi.  Substituting  S(\,DP ),  and 
noting  that  the  objective  function  is  concave  w.r.t.  A  and 
convex  w.r.t.  /i,  we  can  get  the  objective  function  as  follows. 


SS"ifA0')r  £  PpK-+±iU A0») 

\p:DPen  ) 

(U) 

where  K^q  ~  Ad  —  {p|  f^p  =  !)/%>  E 

0},  and  A  =  {A|  Am  =  1 .  Am  >  0},  then  we  can  see 
that  (1 1)  is  equivalent  to  (7). 

3.2.  Cutting  Plane  Algorithm  for  Optimization 

In  (7),  J2p-.DPen  fipKp  can  be  learned  from  the  convex 
combination  of  \Q\  base  matrices.  Each  base  matrix  Kp  is 
generated  from  the  indicator  matrix  Dp ,  which  selects  the 
threshold-smoothing  parameter  pairs  from  the  preset  can¬ 
didates.  We  use  the  cutting  plane  algorithm  [6]  to  solve 
this  problem  efficiently.  Our  approach  generates  a  pool  of 
threshold- smoothing  parameter  candidates  iteratively  with 
the  cutting  plane  algorithm,  which  makes  the  number  of 
base  matrices  in  each  iteration  much  smaller  than  the  orig¬ 
inal  problem.  Thus,  we  can  solve  the  sub-problem  in  each 
iteration  efficiently. 

The  detailed  steps  to  solve  problem  (7)  are  described  as 
follows.  Denoting  the  current  active  set  as  C  c  £2,  we  first 
initialize  the  Lagrange  multiplier  vector  A  to  be  ^1,  where 
1  indicates  a  vector  of  n  ones,  and  find  the  most  violated 
indicator  matrix  P  E  O.  In  the  first  iteration,  we  let  the  ini¬ 
tial  active  set  be  C  =  {P},  then  transform  problem  (7)  into 
its  primal  form  and  get  a  new  solution  of  A.  We  continue  to 
find  the  most  violated  P  and  add  it  into  the  active  set  C.  We 
repeat  finding  Lagrangian  multipliers  vector  A  and  the  most 
violated  indicator  matrix  P  until  it  converges. 

Assuming  that  in  the  P-th  iteration  of  the  cutting  plane 
algorithm,  the  current  active  set  C  =  {P1,  P2, . . . ,  Pp}, 
and  the  problem  in  (7)  corresponds  to  the  following  primal 
optimization  problem: 


min 


S.t. 


1 

2 


Epk 

P= 1  ^ 


m=  1 


(12) 


p 

^2u>p9r>p(xm)  >  p-£m,Vrn  =  1 
P=  1 
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which  can  be  solved  following  [16]  as:  1)  fix  /i  and  solve 
the  dual  of  SVM  to  update  A,  2)  fix  A,  use  the  reduced  gra¬ 
dient  method  to  update  /i.  The  complete  illustration  of  the 
method  to  solve  problem  (7)  is  shown  in  Algorithm  1 .  The 
complexity  of  our  algorithm  is  the  same  as  Lihlinear  [3], 
which  is  very  efficient  for  large-scale  data. 

Algorithm  1:  Feature  Weighting  via  Optimal  Thresh¬ 
olding 

1  Initialize  A  =  ^1,  find  most  violated  D,  let  C  =  { D }; 

2  repeat 

3  Initialize  p  =  [1]T,  P  1; 

4  repeat 

5  Fix  //,  solve  the  dual  of  SVM  as  follows  to 
update  A 

max-l(A°y)T  |Eav^p  +  (Ao^; 

6  Fix  A,  use  the  reduced  gradient  method  to 
update  p\ 

7  P<-P  +  l; 

8  until  convergence ; 

9  Find  the  most  violated  indicator  matrix  D  and 
make  C  =  C  U  {D}  ; 

10  until  convergence ; 


3.3.  Finding  the  Most  Violated  Indicator  Matrix  D 

After  updating  A  and  p,  in  each  iteration,  we  need  to  solve 
the  following  optimization  problem  to  find  the  most  vio¬ 
lated  D , 

n  n 

max  EE  A  m  \y  m  Uql^D  (^m5  %q) 

771=1  (/=1 
n  n 

EE  AmA qymyq[^(D  ©  F(xm))]Ttf  (£>  ©  F(xq)) 

777=1  (/=1 
77  77 

=4>  max  EE  ^  [(Ami/m-F'(xrn))  Q  (XqyqF(xq))  ©  77] 

777=1  Q=1 

(13) 

77  77 

Defining  a  matrix  ©  =  E  E  A m\Vmyq{Fm  ©  we 

777=1  Q=1 

can  get  the  global  optimal  solution  of  (13)  by  setting  D ^ 
to  1  if  Qijk  is  the  element  with  the  largest  value  in  the  i- th 
row  of  0.  Otherwise,  we  set  to  0. 

4.  Experiments 

We  test  our  approach  on  three  publicly  available  datasets: 
HMDB  action  dataset  [8],  Columbia  Consumer  Video 
(CCV)  dataset  [5]  and  TRECVID  MED  2011  dataset  [1] 


(including  DEV-T  and  DEV-0  collections).  In  the  exper¬ 
iments,  we  use  the  same  pipeline  as  described  in  [24]  to 
evaluate  the  performance  of  the  proposed  method  on  ac¬ 
tion  recognition,  video  classification  and  event  detection. 
In  CCV  dataset,  we  use  all  the  acoustic  and  visual  fea¬ 
tures  provided  by  the  authors  in  [5].  In  MED  datasets, 
we  generate  the  BoWs  representation  as  follows.  For  vi¬ 
sual  features,  e.g .,  MoSIFT  [2],  STIP  [10],  Dense  Trajecto¬ 
ries  [23]  and  SIFT  [12],  we  use  the  same  setting  as  we  did  in 
[13,  26]  to  generate  the  32,768  dimensional  BoWs.  In  ad¬ 
dition  to  visual  features,  we  use  4,096  dimensional  MFCC 
BoWs  [26,  15,  19]  as  the  acoustic  feature  in  the  event  de¬ 
tection  experiment. 

In  the  classification  process,  we  adopt  LIBSVM  to  gen¬ 
erate  the  confidence  scores  from  the  probability  outputs, 
and  x2 -kernel  is  applied  to  each  type  of  features.  We  cal¬ 
culate  the  x2 -kernel  for  each  feature  as  described  in  [24]. 
Except  for  the  confidence  scores  from  basic  features,  we 
also  use  the  predictive  scores  on  average  of  kernel  matrices 
to  enhance  the  performance. 

We  compare  the  result  with  state-of-the-art  fusion  algo¬ 
rithms,  including  Early  Kernel  Fusion  (EKF)  [18],  Multi¬ 
ple  Kernel  Learning  (MKL)  [16],  and  LPBoost  [4].  Other 
late  fusion  method  like  linear  SVM  on  top  of  normalized 
decision  scores  from  all  the  different  features  has  similar 
optimization  goal  and  consistent  performance  with  the  LP¬ 
Boost.  Thus  in  the  late  fusion  comparison  algorithms,  we 
only  report  the  result  of  LPBoost.  In  the  multi-class  classi¬ 
fication  task  (HMDB),  we  use  LP -j3  [4],  a  variant  of  LP¬ 
Boost,  which  is  designed  particularly  for  feature  combi¬ 
nation  problem  in  multi-class  classification.  In  TRECVID 
MED  DEV-T  and  DEV-0  collections,  we  additionally  com¬ 
pare  the  result  with  Adaptive  Late  Fusion  (ALF)  [15],  which 
is  particularly  designed  for  event  detection. 

In  the  stage  of  presetting  threshold- smoothing  parame¬ 
ter  candidates,  we  sample  every  10  confidence  scores  as 
threshold  candidates  and  empirically  set  smoothing  param¬ 
eter  candidates  as  {0.5,  0.6,  . . . ,  0.9}.  All  the  parameters  in 
our  proposed  method  and  compared  algorithms  are  selected 
from  {10-4, 10-2, . . . ,  104}  according  to  cross-validation 
except  the  parameter  v  in  LPBoost  and  LP-/3,  which  is  cho¬ 
sen  from  {0.5,  0.6,. . . ,  0.9}  as  suggested  by  [4]. 

4.1.  Experiment  on  HMDB  dataset 

HMDB  [8]  is  a  large  action  recognition  dataset,  which 
has  been  recently  collected  by  Kuehne  et  al.  There  are  6,766 
videos  in  total  from  51  distinct  action  categories  in  HMDB. 
Each  category  contains  at  least  101  clips.  It  is  claimed  in  [8] 
that  it  is  the  largest  and  perhaps  the  most  realistic  available 
dataset  for  human  action  recognition.  The  huge  diversity 
in  visible  body  parts,  camera  motion,  camera  viewpoint, 
number  of  people  in  the  action  and  video  quality  makes  it  a 
very  difficult  benchmark  dataset  for  the  state-of-the-art  ac- 


3444 


Method 

Mean  Accuracy(%) 

Dense  Trajectories  [23] 

46.6 

EKF  [18] 

46.8 

MKL  [16] 

46.9 

LPBoost  [4] 

47.2 

FWOT 

48.9 

Table  3.  Recognition  accuracies  on  the  HMDB  dataset  [8].  The 
top  row  shows  the  performance  of  the  best  individual  feature,  and 
others  indicate  performance  of  fusion  methods. 


Method 

Mean  AP  (%) 

SIFT  [12] 

52.8 

EKF  [18] 

52.9 

MKL  [16] 

57.1 

LPBoost  [4] 

56.8 

FWOT 

60.3 

Table  4.  Mean  AP  on  the  CCV  dataset  [5].  The  top  row  shows 
the  performance  of  the  best  individual  feature,  and  others  indicate 
performance  of  fusion  methods. 


tion  recognition  algorithms.  The  recognition  accuracy  base¬ 
line  given  in  [8]  is  only  20.44%  for  the  HOG/HOF  system 
and  22.83%  for  the  C2  system. 

In  our  experiment,  we  use  the  official  three  standard 
training/testing  splits  identified  by  [8],  which  contain  70 
videos  for  training  and  30  videos  for  testing  in  each  action. 
We  use  four  features  as  basic  features,  namely  MoSIFT, 
STIP,  Dense  Trajectories  and  SIFT.  Before  the  fusion  stage, 
we  train  a  multi-class  SVM  classifier  for  each  visual  fea¬ 
ture  with  one-vs-all  approach.  Confidence  scores  for  train¬ 
ing  videos  are  obtained  by  5-fold  cross-validation.  After 
weighted  fusion,  we  choose  the  action  category  with  high¬ 
est  confidence  score  as  the  predicted  result.  Results  are 
shown  in  Table  3,  in  which  we  list  the  performance  of  the 
best  individual  feature  Dense  Trajectories  to  show  the  im¬ 
provement  of  the  fusion  methods  over  the  individual  fea¬ 
ture.  Comparison  in  Table  3  shows  that  for  action  recog¬ 
nition  in  unconstrained  videos  using  the  HMDB  dataset, 
our  proposed  method  outperforms  the  state-of-the-art  fusion 
methods  by  appropriately  assigning  optimal  weights  to  mul¬ 
tiple  features. 

4.2.  Experiment  on  Columbia  Consumer  Video 
dataset 

For  the  video  classification  task  ,  we  use  Columbia  Con¬ 
sumer  Video  dataset  (CCV)  [5]  to  compare  the  performance 
of  different  fusion  methods.  In  the  CCV  dataset,  there  are 
totally  9,317  videos  with  20  semantic  categories,  in  which 
4,659  videos  are  used  as  training  data  and  4,658  videos  are 
used  as  testing  data.  The  semantic  categories  contain  events 
like  “baseball”  and  “parade”,  scenes  like  “beach”,  and  ob¬ 
jects  like  “cat”.  Consumer  videos  contain  very  diverse  con¬ 
tent  and  have  much  fewer  textual  tags  and  descriptions, 
which  motivates  the  content  analysis  based  on  both  acoustic 
and  visual  features.  Since  the  authors  have  not  provided  the 
original  videos  of  the  dataset,  we  use  the  three  features  pro¬ 
vided  by  [5]:  STIP  features  with  5,000  dimensional  BoWs 
representation,  SIFT  features  extracted  every  two  seconds 
with  5,000  dimensional  BoWs  representation,  and  MFCC 
features  with  4,000  dimensional  BoWs  representation. 

Similarly  to  the  experiment  on  HMDB,  we  use  \2 -kernel 
to  train  non-linear  SVMs  and  use  5-fold  cross-validation  to 


Method 

Mean  AP 

Mean  Pmiss 

Dense  Trajectories  [23] 

0.354 

0.399 

EKF  [18] 

0.414 

0.358 

MKL  [16] 

0.412 

0.357 

LPBoost  [4] 

0.415 

0.365 

ALF  [15] 

0.437 

0.346 

FWOT 

0.442 

0.338 

Table  5.  Comparison  of  Mean  Average  Precision  (AP)  and  Mean 
Pmiss  @TER=  12.5  (Pmiss)  of  different  methods  on  MED  2011 
DEV-T  collection  [1].  LOWER  Mean  Pmiss  indicates  BETTER 
performance.  Top  row  shows  the  performance  of  the  best  individ¬ 
ual  feature. 

get  the  decision  values  for  the  training  data.  Mean  Aver¬ 
age  Precision  is  used  as  evaluation  metric  as  in  [5].  In  Ta¬ 
ble  4,  we  report  the  experiment  results  of  different  fusion 
methods,  and  the  performance  of  the  best  individual  fea¬ 
ture  SIFT  is  reported  as  well.  Since  in  CCV  dataset,  the 
semantic  concept  is  more  complex  than  the  simple  action 
in  HMDB  dataset,  fusing  the  scene  information  ( e.g .  SIFT) 
and  acoustic  information  (e.g.  MFCC)  improves  the  perfor¬ 
mance  of  classification.  We  can  see  from  the  table  that  our 
proposed  method  could  discriminate  features  in  different 
situation,  and  achieve  significant  improvement  over  other 
fusion  methods. 

4.3.  Experiment  on  TRECVID  MED  2011  dataset 

Multimedia  Event  Detection  (MED)  [1]  is  a  part  of  the 
TRECVID  tasks.  MED  raises  a  question  in  communities 
of  multimedia  and  computer  vision:  given  some  descrip¬ 
tions  of  an  event  and  a  set  of  illustrative  video  exemplars, 
could  a  system  detect  the  occurrence  of  an  event  using 
acoustic  and  visual  information  (individually  or  together)? 
In  2011,  NIST  collected  a  dataset  which  consists  of  about 
32,000  testing  videos  from  various  Internet  video  hosting 
sites,  namely  the  DEV-0  collection.  Then  a  dataset  which 
consists  of  about  9,700  training  videos,  namely  DEV-T  col¬ 
lection,  is  used  as  development  dataset  for  the  participants 
in  TRECVID  2011.  Detailed  information  about  DEV-T 
and  DEV-0  collections  can  be  referred  to  [1].  MED  11 
DEV-0  collection :  10  events  are  used  in  the  DEV-0  col¬ 
lection  to  test  the  performance  of  multimedia  event  detec¬ 
tion  system.  These  events  include  “Birthday  party  (BP)”, 
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“Changing  a  vehicle  tire  (CaVT)”,  “Flash  mob  gathering 
(FMG)”,  “Getting  a  vehicle  unstuck  (GaVU)”,  “Grooming 
an  animal  (GaA)”,  “Making  a  sandwich  (MaS)”,  “Parade 
(PR)”,  “Parkour  (PK)”,  “Repairing  an  appliance  (RaA)”, 
and  “Working  on  a  sewing  project  (WaSP)”.  For  each  event, 
1 1 1  to  173  video  exemplars  are  provided.  The  total  duration 
of  the  DEV-0  collection  is  about  1,200  hours,  which  makes 
it  possibly  the  largest  available  dataset  with  meaningful  la¬ 
bels  for  video  analysis. 

MED  11  DEV-T  collection :  In  DEV-T  collection,  there 
are  totally  18  events.  In  addition  to  the  10  events  in  DEV-O, 
there  are  another  8  events  in  the  DEV-T  collection,  includ¬ 
ing  “Attempting  a  board  trick  (AaBT)”,  “Feeding  an  ani¬ 
mal  (FaA)”,  “Landing  a  fish  (LaF)”,  “Wedding  ceremony 
(WC)”,  “Working  on  a  woodworking  project  (WoaWP)”, 
“Making  a  cake  (MaC)”,  “Batting  a  run  (BaR)”,  and  “As¬ 
sembling  a  shelter  (AaS)”. 

Different  from  the  recognition  datasets,  many  videos  in 
the  MED  2011  DEV-T  and  DEV-0  collections  do  not  be¬ 
long  to  any  events,  which  are  called  null  data.  The  videos  in 
DEV-T  and  DEV-0  collections  have  huge  variance  in  terms 
of  quality,  duration,  scene  and  so  forth  [1],  which  makes  the 
MED  a  great  challenge  for  content  based  video  analysis. 

In  our  experiment,  all  of  the  positive  video  exemplars  for 
each  event  are  used  in  the  training  data.  In  DEV-T  collec¬ 
tion,  we  use  all  the  null  videos  as  negative  exemplars.  In 
DEV-0  collection,  we  sample  1,000  videos,  which  do  not 
belong  to  any  event,  as  negative  exemplars.  As  for  video 
representation,  we  use  Dense  Trajectories,  STIP,  TCH  [21] 
and  MFCC  as  the  basic  features.  When  detecting  one  event, 
we  train  a  binary  x2 -kernel  SVM  classifier  for  each  fea¬ 
ture  to  obtain  the  confidence  scores.  5-fold  cross-validation 
is  used  to  get  the  confidence  scores  for  training  data.  In 
the  evaluation  of  DEV-T  and  DEV-0  collections,  we  use 
two  evaluation  metrics.  One  is  the  Average  Precision  (AP), 
which  is  popularly  used  as  the  evaluation  metric  in  imbal¬ 
anced  binary  classification  problems.  The  other  is  Prob¬ 
ability  of  Miss -Detection  based  on  the  Detection  Thresh¬ 
old  12.5 ,  which  is  the  standard  evaluation  metric  used  by 
NIST  [1]  in  MED  to  evaluate  the  performance  of  a  detec¬ 
tion  system.  We  denote  the  second  evaluation  metric  as 
Pmiss@TER=12.5  for  short.  Different  from  AP  metric, 
lower  Pmiss@TER=12.5  indicates  better  performance. 

We  show  the  comparison  of  Mean  AP  and  Mean 
Pmiss@TER=12.5  of  different  methods  on  DEV-T  collec¬ 
tion  and  DEV-0  collection  in  Table  5  and  Table  6.  We  ad¬ 
ditionally  compare  our  algorithm  to  Adaptive  Late  Fusion 
(ALF),  which  was  proposed  in  [15]  particularly  for  event 
detection.  Our  method  achieves  the  best  performance  in 
both  collections.  Note  that  in  the  Adaptive  Late  Fusion 
(ALF)  algorithm,  thresholds  are  set  before  the  fusion  pro¬ 
cess,  and  bad  thresholds  would  lead  to  weak  performance  of 
ALF  method.  Different  performance  in  DEV-T  and  DEV- 


Method 

Mean  AP 

Mean  Pmiss 

Dense  Trajectories  [23] 

0.240 

0.367 

EKF  [18] 

0.310 

0.318 

MKL  [16] 

0.310 

0.307 

LPBoost  [4] 

0.322 

0.310 

ALF  [15] 

0.210 

0.359 

FWOT 

0.336 

0.294 

Table  6.  Comparison  of  Mean  Average  Precision  (AP)  and  Mean 
Pmiss  @TER=  12.5  (Pmiss)  of  different  methods  on  MED  2011 
DEV-0  collection  [1].  LOWER  Mean  Pmiss  indicates  BETTER 
performance.  Top  row  shows  the  performance  of  the  best  individ¬ 
ual  feature. 

O  collections  shows  that  ALF  may  suffer  from  the  difficulty 
of  getting  a  good  detection  threshold  and  show  unstable  per¬ 
formance  in  the  fusion  stage.  On  the  contrary,  our  method 
learns  proper  thresholds  in  the  process  of  weighting  fusion, 
which  makes  the  fusion  method  more  robust  in  the  event 
detection  system.  In  Figure  3  we  show  the  comparison  of 
Average  Precision  and  Pmiss@TER=12.5  of  different  fu¬ 
sion  methods  on  every  event  in  TRECVID  MED  1 1  DEV-0 
collection.  We  can  see  that  our  fusion  method  outperforms 
other  state-of-the-art  fusion  algorithms  in  8  out  of  10  events 
in  TRECVID  MED  11  DEV-0  collection. 

5.  Conclusion 

In  this  paper,  we  have  introduced  an  approach  to  leverage 
multiple  features  by  decision-level  fusion,  which  optimizes 
the  weights  and  thresholds  for  features  in  the  confidence 
scores  simultaneously.  We  formulate  the  problem  as  se¬ 
lecting  the  most  appropriate  combination  of  thresholds  and 
smoothing  parameters,  based  on  which  the  optimal  weights 
are  learned.  We  first  preset  lots  of  thresholds  and  smoothing 
parameter  candidates,  then  we  use  the  cutting  plane  algo¬ 
rithm  to  obtain  the  optimal  weights  and  thresholds,  which 
is  very  efficient  even  in  a  large-scale  problem.  Experiments 
on  HMDB  dataset  and  CCV  dataset  show  that  our  approach 
outperforms  other  state-of-the-art  methods  on  action  recog¬ 
nition  and  consumer  video  classification.  In  addition,  we 
achieve  the  best  performance  among  different  fusion  meth¬ 
ods  on  a  large-scale  video  dataset  TRECVID  MED  2011 
(including  DEV-T  and  DEV-0  collections)  using  both  Av¬ 
erage  Precision  and  Pmiss@TER=12.5  metrics.  The  exper¬ 
imental  results  confirm  that  our  method  is  superior  to  other 
fusion  methods  for  different  video  analysis  tasks. 
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